Unfortunately, we did not get our linear regression code to work in dashboard format. Please see our reflection below for regression analysis and our dashboard .rmd file for our attempt
As we have discussed in class, many urban problems exacerbate the inequalities within an urban system. After over a year and a half of pandemic, Covid-19 still poses a threat to the population’s ability to health, travel, work, and gather, though some are more affected than others. It is no secret that Covid-19 is greatly affecting communities of color and impoverished communities, and these two groups have a lot of overlap. One such community of interest in examining the impact of Covid-19 on vulnerable populations is Oakland, because it contains zip codes with large differences in poverty levels as well as many racially segregated neighborhoods. Oakland is a high impact city to look at because of its large and diverse population. With over 433,000 residents, the city has a relatively equal split between White, Black, and Hispanic/Latino population with a strong Asian presence as well (source 4). In addition, the poverty rate is 16.7%, which is over 5% higher than the national average.
With Oakland’s demographic, Covid-19 has impacted populations in a disproportionate way. One example of this is seen by comparing the zip codes 94603 and 94618, which are on opposite sides of Oakland. Zip code 94603 has 30% of children living below the poverty level and the highest Covid rate in the city whereas zip code 94618 has 4% of children living below the poverty level and the lowest Covid rate in the city (source 1). Another key insight about this example is that zip code 94618 is 77% White while zip code 94603 is majority Black or African American (source 2).
Furthermore, at the City of Oakland Racial Disparities Task Force Town Hall it was mentioned that in Alameda County, Latinx make up 22% of the general population and 46% of the COVID-19 caseload and African-Americans make up 10% of the general population and 14% of COVID-19 cases (source 3). These startling statistics clearly indicate further need to prioritize researching the impact of Covid-19 on the basis of both race and poverty. Local leaders in Oakland have announced how extremely worried they are about these racial disparities in people of color as well as low-income people, immigrants, people with disabilities, and others (source 5). There are many factors that could also be indicative of Covid-19 impact, risk, and response, so we were also curious to see how the numbers might also relate with covid testing rates, which could alert community members to infection and allow them to take the necessary precautions to avoid further spread. We plan on finding relationships between these variables as well as map them geographically to provide some insights on Covid-19 in Oakland as well as predict future Covid-19 rates.
For this reason, we decided to use Census Data for income and race and ArcGIS Hub data that records Alameda County COVID-19 Cases and Case Rates over the past 28 Days by Zip Code (https://hub.arcgis.com/datasets/5d6bf4760af64db48b6d053e7569a47b_0/explore?location=37.679493%2C-121.905640%2C10.88, https://hub.arcgis.com/datasets/5d6bf4760af64db48b6d053e7569a47b/explore?layer=4&location=37.679103%2C-121.905640%2C10.88 ).
Below is a map of Alameda County (highlighted in blue). We highlighted Oakland as the area of interest with the red highlighting the different zipcodes in Oakland.
Our ultimate goal was to create a dashboard that makes it easy to view Oakland zipcodes in terms of the presence of a chosen income level, race, COVID testing rate, and COVID case rate. Some key insights we hoped to find were 1) how the amount of COVID testing correlates to covid cases (ie. does testing seem to actually have a strong relationship with covid case rate as many say), 2) does access to testing appear equal on the basis of racial or economic background (ie. what does accessibility seem like?) and 3) also evaluating the disparities in Oakland more generally. One issue to note was that due to having data only as granular as the zipcode level for Oakland, it was not statistically significant to only provide regression data for the variables based on Oakland zipcodes, so our overall analysis of the trends seen in Oakland through our graphs had to be supplemented with regression results for Alameda county, which does add a variable of inconsistency, but we still felt we were able to draw informative conclusions.
Our reflections with specific instances of graphs we wanted to point out are presented below, but the link to our dashboard can be found at the top of this report.
COVID Dataset from Alameda County COVID-19 Case/Case Rates by Zip Code GeoJSON API URL: “https://opendata.arcgis.com/datasets/5d6bf4760af64db48b6d053e7569a47b_0.geojson”
COVID Dataset from Alameda County COVID-19 Test Rates by Zip Code GeoJSON API URL: “https://opendata.arcgis.com/datasets/5d6bf4760af64db48b6d053e7569a47b_4.geojson”
The covid case and testing rates are a running total of the past 28 days.
As seen in the visual above, there is disproportionate income in Oakland across race, especially when it comes to the two income level extremes (i.e. less than $10,000 vs. $200,000 or more). The first thing to note is that the population breakdown in Oakland is fairly diverse- the White population makes up the largest proportion of households but there is also a large fraction of Black and Asian populations. Despite being close to 45% of Oakland’s population, only about 20% of people making less than $10,000 are White and nearly 75% of people making more than $200,000 are White. This shows the large disparity in regards to White population and income. In addition, the Black population consist of roughly 25% of the total population but has the largest fraction of people making less than $10,000 (at about 45%) and the smallest fraction of people making greater than $200,000 (at about 7%). The other racial groups follow a similiar pattern as the Black population to a lesser degree with the exception of the “Two or More Races”, which was fairly proportional to its population size.
Our first major assumption is that the breakdown of race in this visual is representative of the breakdown of population in Oakland. Another assumption is the validity, completeness, and accuracy of the data set used. This data set is gathered and produced by the US Census, so we are assuming it is from a credible source on the topic and was gathered in a fair and unbiased way.
The maps below resonate similar themes as the equity analysis performed. The insights gained from these map is geographically where these populations are located so we can do a general level inspection relative to covid cases and covid test rates. After these general observations, we performed regression analysis (see below).
Some key observations include: the areas with a large Black population are in Southwest Oakland, the areas with the largest White population are in North Oakland, and the areas of lower income people are in the South Oakland. There is large overlap in areas with higher Black population and lower income people, which affirms the findings from the equity analysis.
Looking at the map of Covid test rates, the largest concentration of high test rates are in the northern part of Oakland. Comparing this with the race/income geography maps, the zip codes with the most test rates are also the zip codes with a high white population as well as a high population of highest income people. The rest of Oakland has roughly the same testing rates, which are a lot lower than the northern testing rates.
Looking at the map of Covid case races is really interesting because there are large disparities in cases. Southern Oakland has several factors more cases than Northern Oakland. The zip codes with the most case rates is also the zip code with a high Black population and relatively lower income population. The lowest covid case rates are in Northern Oakland, which are the the zip codes with the highest test rates, mostly White and high income.
These preliminary findings are further investigated below.
Before beginning our regression analysis, we noticed that when trying to see the relationship of COVID cases versus COVID testing that several outliers were present and made the results less relevant, so we wanted to find which zipcodes caused these outliers and removed them to allow the results to be more accurate.
Below is a plot of all the Alameda zipcode data:
##
## Call:
## lm(formula = CaseRates ~ TestRates, data = alameda_grouping_by_zip1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6048.2 -2469.0 -298.2 1963.1 11266.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.791e+03 5.748e+02 10.075 1e-13 ***
## TestRates 1.608e-01 5.636e-02 2.853 0.00624 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3800 on 51 degrees of freedom
## Multiple R-squared: 0.1376, Adjusted R-squared: 0.1207
## F-statistic: 8.139 on 1 and 51 DF, p-value: 0.006244
From this we can identify there are some zipcodes with data points that deviate far from the rest of the data points, specifically 94720, 95377, 94621,94613 and 94603, which are significatly higher either in terms of test rate or case rate, so we will remove these frames from the regression to hopefully help improve it.
We now see that the graph looks better
##
## Call:
## lm(formula = CaseRates ~ TestRates, data = alameda_grouping_by_zip1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5403.3 -2223.5 -18.9 1524.6 7662.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8675.3365 1187.6554 7.305 3.19e-09 ***
## TestRates -0.9212 0.3849 -2.393 0.0208 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2796 on 46 degrees of freedom
## Multiple R-squared: 0.1107, Adjusted R-squared: 0.09138
## F-statistic: 5.727 on 1 and 46 DF, p-value: 0.02085
As we can see, removing the outliers changed the trend of the graph, and so now we are ready for analysis. The linear regression appears to imply an association between Test Rates and Case Rates. More specifically, “An increase of Covid Testing Rates over the past 28 days by 1 unit is associated with an decrease of Covid Case Rates over the past 28 days by -0.92”. Furthermore, we see that the p-value is less than .05, so the results seem statistically significant, although the residuals are not centered around 0,bringing into question some of the validity of this correlation.
##
## Call:
## lm(formula = estimate ~ CaseRates, data = alameda_grouping_by_zip2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -461.0 -153.8 -56.0 109.5 1351.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.831244 43.117030 1.017 0.311
## CaseRates 0.053901 0.006467 8.335 1.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 260.1 on 190 degrees of freedom
## Multiple R-squared: 0.2677, Adjusted R-squared: 0.2639
## F-statistic: 69.47 on 1 and 190 DF, p-value: 1.531e-14
The linear regression appears to imply an association between the number of people making below $25,000 annually and Covid Case Rates over the past 28 days. More specifically, “An increase of the number of people making below 25,000 annually by 1 unit is associated with an increase of covid case rates over the past 28 days by .05. However, while the p-value seems to indicate statistical significance, we see that the residuals are not centered around 0, signaling that this claim is not necessarily accurate and would require further exploration.
##
## Call:
## lm(formula = estimate ~ TestRates, data = alameda_grouping_by_zip3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1461.3 -1102.5 -597.1 912.0 7146.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1650.3731 677.4604 2.436 0.0188 *
## TestRates -0.1104 0.2196 -0.503 0.6175
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1595 on 46 degrees of freedom
## Multiple R-squared: 0.005467, Adjusted R-squared: -0.01615
## F-statistic: 0.2529 on 1 and 46 DF, p-value: 0.6175
##
## Call:
## lm(formula = estimate ~ TestRates, data = alameda_grouping_by_zip4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6211.2 -2178.4 -718.8 1279.2 9396.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7793.0748 1483.9155 5.252 3.76e-06 ***
## TestRates -0.7014 0.4810 -1.458 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3494 on 46 degrees of freedom
## Multiple R-squared: 0.04419, Adjusted R-squared: 0.02341
## F-statistic: 2.127 on 1 and 46 DF, p-value: 0.1515
As we can see from the model summary, in both cases the P-value and residuals imply the results are not statistically significant, which actually underlines some ambiguity on how much access to COVID tests is or isn’t a problem, at least in Alameda county. In the graph of Black population versus test rates, we see that while some zipcodes with more Black people had lower test rates, we also see areas with less Black people having both significantly more and less covid testing. For the White population, the spread seems fairly even , with both highs and lows in test rates in varying population counts. However, there are many caveats to this, and should not be to say that covid testing is or is not as accessible to a certain racial group, but rather that we need more data. Additionally, other factors could boost or lower testing numbers unrelated to race, like if a given zipcode simply has more or less people in it, which is a downside of our dataset providing raw number rather than a percentage.
Overall, it was nice to see that covid testing does appear to have a positive association with a decrease in covid cases, although the correlation did not seem particularly strong, and furthermore it was hard to make conclusions about the correlations between other racial and economic backgrounds and covid test accessibility and case rates. I think a large reason for this is because the association must be more loosely made, since we do not know which of the covid cases or tests were from people of different economic or racial groups. However, I still think there was knowledge to be gained from this project and results analysis by understanding the complexity and realizing areas where we should go deeper while also weighing other concerns that come with trying to get clearer data, like anonymity (for example, it could be quite problematic to some people to report their racial or economic identity along with a covid test or case report in a publicly provided dataset).
Sources: https://calmatters.org/health/coronavirus/2021/06/california-covid-inequality-oakland-rockridge/ https://www.unitedstateszipcodes.org/ https://www.accfb.org/how-covid-19-is-affecting-communities-of-color/ https://www.census.gov/quickfacts/oaklandcitycalifornia https://www.oaklandca.gov/news/2020/local-leaders-announce-covid-19-racial-disparities-task-force